Perturbation of interaction networks for application to cancer therapy.

We present a computational approach for studying the effect of potential drug combinations on the protein networks associated with tumor cells. The majority of therapeutics are designed to target single proteins, yet most diseased states are characterized by a combination of many interacting genes and proteins. Using the topology of protein-protein interaction networks, our methods can explicitly model the possible synergistic effect of targeting multiple proteins using drug combinations in different cancer types. The methodology can be conceptually split into two distinct stages. Firstly, we integrate protein interaction and gene expression data to develop network representations of different tissue types and cancer types. Secondly, we model network perturbations to search for target combinations which cause significant damage to a relevant cancer network but only minimal damage to an equivalent normal network. We have developed sets of predicted target and drug combinations for multiple cancer types, which are validated using known cancer and drug associations, and are currently in experimental testing for prostate cancer. Our methods also revealed significant bias in curated interaction data sources towards targets with associations compared with high-throughput data sources from model organisms. The approach developed can potentially be applied to many other diseased cell types.


Aims and motivation
Research into the genetic basis of many different diseases has developed our knowledge of the links between particular genes, proteins and diseases, such as the wellknown P53 gene and cancer. Most diseased states are polygenic, however, and cannot be explained or characterized by a single gene, but rather by a combination of interacting genes and their products (Vogelstein et al. 2000). Therapeutic drugs traditionally target single highly connected proteins in the networks associated with diseased cells, in order to elicit a response. Some degree of effi cacy is possible using such an approach, but a knowledge and characterization of the gene and protein interaction networks associated with a diseased cell state is important for the development of improved therapeutics. For example, highly connected proteins are more likely to have a critical role in the protein interaction networks of normal cells, and hence side effects and chemotoxicity will often result with such an approach (Keyomarsi and Pardee, 2003). In addition to disease therapy, a network level approach is also required to understand many other phenotypic processes, such as development (Gilbert, 2000).
We present an approach which characterizes cancer types on a network level, by developing a model of the interaction networks present in tumor cells. The primary motivation for using such an aproach is to search for potential combinations of drugs which give improved effi cacy for cancer treatments compared to existing therapies typically involving a single drug. The potential improved effi cacy from using a multi-drug approach to cancer therapy has already been recognized (Mitchell, 2003), but research has not yet attempted to discover novel drug combinations from a knowledge of the underlying associated networks. For example, if a given drug is known to inhibit tumor growth to some extent, we can computationally search for another drug to be used in combination with the known drug to give improved effi cacy. Part of the premise of such an approach is that network topology is a useful predictor for cancer and diseased targets in general. Network topology has been Quayle et al shown to infl uence many biological properties including gene essentiality (Jeong et al. 2001;Yu et al. 2004), expression (Lukashin et al. 2003;Herrgard et al. 2003) and function (Dunn et al. 2005;Guimerà and Amaral, 2005) amongst others, and therefore some infl uence or correlation with target suitability seems likely. Yet, to our knowledge this correlation has not yet been studied in any detail.
We develop models of the networks associated with tumor cells and equivalent normal cells in H. sapiens by mapping gene expression comparison data onto networks of protein-protein interactions. The interaction data is collated both from established sources of curated H. sapiens interaction data and from a set of ortholog interactions in model organisms. We search for potential drug combinations using a computational simulation of the effect of removing target protein combinations from the network associated with a given tumor cell (cancer network). A number of alternative measures or ranking methods are introduced to model the criticality of these protein combinations, which are used to predict potential drug combinations by relating the relevant proteins to known drugs. A network level model is a relatively new approach to cancer research but-together with the integration of gene expression data-has already been noted as a useful avenue for drug discovery for cancer therapies (Huang, 1999). Due to current limitations in data coverage and accuracy, in particular for protein interaction data, such a network-level model cannot be expected to provide a complete solution, but may provide signifi cant predictive power.
Interaction network models can be developed on a number of levels, using a global network model such as in the current work, possibly incorporating different network types, or alternatively using more detailed models of sub-networks or individual pathways. One of the ultimate goals of biological network models is to produce an accurate model of the complete network of interactions in a cell or tissue type, often referred to as the interactome (Walhout et al. 2002). In this approach, signalling networks, metabolic networks, and protein networks are all explicitly included in a single model, which is currently only feasible for relatively simple organisms such as E. coli (Juty et al. 2001). Models for more complex organisms focus on particular network types, such as transcriptional networks (Bolouri and Davidson, 2002), or metabolic networks (Kell, 2004;Stelling et al. 2002). Topological network models have shown that many cellular networks exhibit a power-law degree disribution , and that their structure has a modular nature (Ravasz et al. 2002;Rives and Galitski, 2003).

Network perturbation
We use network perturbation to search for novel target combinations, and consider the perturbation of one network relative to another, which we call "preferential perturbation". A single network perturbation approach attempts to maximize the perturbation to a cancer network, whereas an approach involving preferential perturbation in this case attempts to maximize the perturbation to a cancer network, while minimizing the resultant perturbation to a related normal network.
The susceptibility of networks to attack and failure has been studied for many network models and real world networks (Holme et al. 2002). The robustness of real-world networks is critical in a wide-range of contexts, including power-grids and general transport networks, communication networks and the Internet (Holme et al. 2002) (Cohen et al. 2000). Most studies have investigated the robustness of single networks or systems of networks, and have not considered the robustness of one network relative to another. The preferential perturbation of a pair of networks with non-zero similarity is a very different problem to single network perturbation, and we study the principles and topological dependency of preferential network perturbation in a separate publication (Quayle et al. 2006). We show that the extent of preferential perturbation of a pair of networks depends on a number of topological parameters, including the network similarity, size and average degree amongst others. The work in (Quayle et al. 2006) uses random (ER model) networks (Erdös and Rényi, 1960) and so called "scale-free" (BA model) networks (Barabási and Albert, 1999) to develop results and principles for general networks.

Model and Methods
We describe the methods and approach used in detail in Sections 2.1 to 2.3, in three conceptually distinct and sequential phases. The first phase is the collation of interaction data (Section Perturbation of Interaction Networks 2.1), which generates underlying network models. Gene expression data is used to generate networks associated with specific cell types (Section 2.2) and the final phase involves the simulation of network perturbations to study these networks (Section 2.3). We also describe the disease and drug association data used (Section 2.4) and details of statistical techniques used for data analysis (Section 2.5).

Protein-protein interaction data
We combine a number of interaction data sources to improve coverage over that which could be acheived using only one such data source. We collated and populated curated H. sapiens interaction data from BIND (Bader et al. 2003), HPRD (Peri et al. 2003) and DIP (Xenarios et al. 2002) to give a total of 9,020 interactions between 4,524 HUGO genes (Wain et al. 2004). Since the interaction data currently available for H. sapiens is relatively limited in terms of coverage, we also added a set of predicted interactions in H. sapiens which was derived from interactions in model organisms by Lehner and Fraser (Lehner and Fraser, 2004). By mapping proteins referenced in interaction data from the model organisms C. elegans, D. melanogaster and S. cerevisiae to orthologous proteins in H. sapiens using the Inparanoid algorithm (O'Brien et al. 2005), a set of predicted interactions in H. sapiens was obtained. We extracted the core dataset of interactions which have the highest associated confi dence score according to the Inparanoid mapping algorithm, to give 6,958 unique interactions. The majority of results presented here are based only on the higher confi dence curated data sources, but the set of predicted interactions gives improved coverage.

Generation of networks associated with cell types
The combination of data sources described in Section 2.1 generates a network which is a subset of the true network of all possible interactions due to data limitations, and is also subject to the accuracy of interaction data. The particular proteins and interactions present in a cell varies signifi cantly between cell types, depending on the gene and protein expression. A knowledge of the gene expression profi le for a given cell type can be used to generate a model of the associated interaction network. We used expression comparisons of SAGE libraries (Blackshaw et al. 2003) from a particular type of tumor cell and equivalent normal cell to determine those genes which are up-regulated or down-regulated, according to a signifi cance level of 5% and at least a fi ve-fold change in the expression level (after conversion to a parts per million measure).
A tag to gene mapping stage is also required, for which we used the mappings available from SAGE Genie (Boon et al. 2002). Approximately 80% of the tags were successfully mapped to genes, which is a reasonable fraction for tag mapping. A more detailed discussion of alternative approaches and complications involved in tag to gene mapping is described in Pleasance and Jones (Pleasance and Jones, 2005). We map genes that are up-regulated in the cancer library to the cancer network, and down-regulated genes to the normal network, and the non-differentially regulated genes are mapped to both networks. This second phase gives two sub-networks from an underlying network to represent the networks associated with particular types of tumor and normal cells. In this phase of the method, approximately 50% of the genes map to the underlying interaction network (see Section 3.2 for more detailed discussion on data coverage).
Networks for seven different cancer types are generated from expression comparisons of short SAGE libraries available from the Cancer Genome Anatomy Project (CGAP) (Lash et al. 2000), as listed in Table 1.

Network perturbation
In the current work we use a topological characterization of the networks generated from the techniques described in Sections 2.1 and 2.2, as the fi rst phase towards the development of a more detailed understanding and model of these networks. We study the change in network topology when a combination of proteins (network vertices, which we use to refer to proteins later in the paper) is removed, which we call a network perturbation.We search for optimum target combinations which, when removed, maximize the fragmentation of a given cancer network, while minimizing the fragmentation of the associated normal network. It is hoped that this may lead to novel drug combinations, which can target particular proteins, and selectively perturb the cancer network. We measure network perturbation by the size of the giant component, which is the largest component in a network. This is a standard approach used in network robustness studies (Holme et al. 2002;Albert et al. 2000). We defi ne the perturbation score for a single network as, where G 1b and G 1a are the giant component sizes of network 1 before and after perturbation respectively (Quayle et al. 2006). Equation 1 includes normalization based on the initial network, such that the perturbation score ranges between zero and one. We defi ne a preferential perturbation as the difference in perturbation scores for two networks such that, where P 12 reflects a preferential perturbation between two networks. An exhaustive search for the optimum vertex combination which maximizes the perturbation score is not possible due to the size of the associated search space, and a number of alternative optimization approaches were considered. It was found that a technique using successive vertex removal gives the highest score on average (Quayle et al. 2006). That is, the highest ranking vertex is initially removed, parameters are recalculated based on the new networks, the new highest ranking vertex is removed, and this process is repeated iteratively until the required number of vertices have been removed to make up a vertex combination.
Alternative vertex ranking methods were studied, based on parameters such as vertex degree, diffi erence in degree, betweenness, diffierence in betweenness, and the resulting perturbation score (see (Quayle et al. 2006) for further explanation of these methods and a detailed study of their success). It was found that no single method is a universally optimal method, but rather the best method depends on the regime of interest (number or fraction of vertices to remove) and also details of network topology. Therefore we present and compare results from all methods considered for the perturbation of cancer and normal networks.

Target association data
Given a novel target combination we wish to determine whether those targets are druggable or disease associated, and in particular whether they are cancer associated. We used the Therapeutic Target Database (TTD) as a source of target-drug associations (Chen et al. 2002), which includes information on protein and nucleic acid targets, the diseased target, corresponding drugs or ligands and pathway information. After mapping to HUGO IDs we obtained 659 targets, 231 of which have drug associations, and 641 with disease associations at the time of population. Many of the listed disease associations are cancer associations, and many targets are associated with multiple cancers. An additional set of 299 cancer associations was obtained from a literature census of mutated genes implicated in human cancer (Futreal et al. 2004), giving 275 associations after mapping to HUGO IDs.

Correlation tests
We studied the correlation of vertex properties between networks which provides a number of insights, but is principally useful for quantifying the infl uence of the underlying network on the resulting cancer and normal networks. The types of analysis are based either on initial network topology or perturbation results, for which we use standard linear regression (Section 2.5.1) or Spearman's rank correlation coeffi cient (Section 2.5.2) respectively. Perturbation of Interaction Networks 2.5.1. Initial topology Correlations between vertex topological properties have been studied for model networks (Holme et al. 2002;Quayle et al. 2006), and a similar analysis can be made for the cancer and normal networks, and the underlying network. For example, the correlation between vertex degree and betweenness for a given network is visualized by a scatter plot, where the x and y values of each point are the degree and betweenness of a vertex. A least-squares linear regression coeffi cient is used to quantify the strength of the correlation. We calculate the linear correlation coeffi cients between different cancer networks and between the underlying network and cancer networks.

Perturbation results
The output from a perturbation simulation is an ordered vertex list, that is, a set of vertex ranks. The correlation between the rank positions of vertices in two different perturbations can be calculated using Spearman's rank correlation coefficient which is suitable specifically for ranked data. The output from two perturbations typically contains different vertex sets and a different number of vertices, so the coeffi cient is calculated based only on vertices in both output lists. Spearman's rank correlation coeffi cient, r s is given by the expression, where n is the number of paired ranks, and d i is the diffi erence between the two ranks in a pair (in this case for a given vertex) (Zar, 1999). Note that the above expression is strictly only valid for data with no tied ranks, which is true for the perturbation results since only one vertex is removed at a time.

Protein-protein interaction data
The underlying network model constructed from the combination of data sources was analyzed by calculating a range of network properties. Table 2 shows relevant statistics for the data sources and data source combinations, where we use the network terminology vertices and edges to refer to proteins and protein-protein interactions respectively. A number of network parameters are shown in Table 2 which we explain in more detail below, since these parameters are discussed in later sections of this paper. The degree of a vertex is the number of edges connected to that vertex, and the average degree is the vertex degree averaged across all vertices in a network. Similarly, there is a shortest path length between every pair of vertices in a network, and the path length is averaged over all vertex pairs to give the average shortest path. The betweenness of a vertex (Joy et al. 2005), which is discussed later, measures the number of shortest paths passing through a given vertex.
The clustering coeffi cient measures the fraction of transitive triples or triangles between nearest neighbours, and the clustering coeffi cient of vertex i, C i , is given by, where k i is the degree or number of nearest neighbours of vertex i, and E i is the number of edges connecting between these nearest neighbours (Watts and Strogatz, 1998;Wasserman and Faust, 1994). The clustering coeffi cient of a network is calculated by averaging C i over all vertices of the network. Finally the assortativity coefficient measures the tendency of vertices to connect to other vertices which have a similar degree (see Table 2. Data statistics: Topological properties of data sources and data combinations where C1 = BIND + HPRD + DIP and C2 = BIND + HPRD + DIP + Predicted.  (Newman, 2002) for an exact defi nition). In other words, if the "hubs" in a network tend to connect to other hubs, then the network shows assortative mixing. The assortativity coeffi cient is defi ned between minus one for a perfectly disassortative network and plus one for a perfectly assortative network, where a value of zero means there is no assortative mixing. It has been shown that social networks are generally assortative, whereas biological and technological networks are generally disassortative (Newman, 2002). We group the data sources into two data source combinations, where the fi rst is a combination of primary H. sapiens interaction data, and the second combination contains all data sources including predicted interactions. DIP has the least interactions and the smallest average degree, and correspondingly the largest average shortest path. The data source of predicted interactions is the most connected, and has a surprisingly high assortativity coeffi cient of r = 0.446, since biological networks are generally disassortative. This likely refl ects sample bias in this dataset of core interactions, since each data source is only a subset of the complete set of interactions, and is subject to bias in terms of known interactions. This bias and the fact that data derived from orthologous proteinprotein interactions is expected to be less accurate than primary H. sapiens interactions are the motivation for using two alternative data source combinations.

Topology and similarity of cancer and normal network pairs
We use the defi nition of network similarity, S, in terms of network edges given by, where n c is the number of edges in common between the two networks, and n t is the total number of edges in the combined network (see (Quayle et al. 2006) for further explanation). At the two extremes, if the networks are identical then no preferential perturbation is possible, but if the networks have no similarity then a complete perturbation of one network relative to the other is possible if suffi cient vertices are removed, and the problem is equivalent to single network perturbation. Between these limits, some maximum possible preferential perturbation score is associated with two given networks. A detailed derivation of the similarity of both independent and "correlated" network pairs has been determined (Quayle et al. 2006). Correlated networks are derived from a single underlying network, and the cancer and normal network pairs are correlated network pairs, since they are derived from the underlying interaction network. The similarity of such correlated networks is independent of network topology if the vertices which make up the correlated networks are sampled randomly from the underlying network. The similarity of randomly sampled correlated networks varies directly with the vertex set similarity, V S , which is given by, where N C is the number of vertices common to the two networks, and N T is the total number of distint vertices in the networks. For random vertex sampling, we showed by analytical derivation that the expected network similarity, E(S), is equal to the square of the vertex set similarity.
Fourteen cancer and normal network pairs were generated from the seven cancer types listed in Section 2.2 and the two data source combinations described in Section 3.1. After tag mapping, the results from an expression comparison contain around 7,000 genes, which is typically reduced to around 2,000 after singletons-a SAGE tag with only one occurrence in the corresponding library-are removed. There is signifi cant variation in this stage, however, and some cancer types such as prostate have far fewer singletons, leaving around 4,000 genes after the removal of singletons. Approximately 50% of these genes map to the underlying network, giving network sizes of the order of 1,000 genes. The similarity of the generated network pairs was calculated and plotted against the vertex set similarity for each network pair, as shown in Figure 1. The variation of the expected network similarity for a general network pair with random vertex sampling is also shown. Figure 1 shows that the network similarity and vertex set similarity values for the cancer and Perturbation of Interaction Networks normal networks are generally close to the predicted variation for random vertex sampling. Network pairs for different cancer types have network similarity values within the approximate range 0.55 ≤ S ≤ 0.70, with the exception of prostate cancer, where S ∼ 0.90. Therefore the maximum preferential perturbation score is likely to be lower for the prostate cancer networks, depending also on other topological parameters such as networksize, average degree and clustering coeffi cient, which all infl uence network robustness to some extent (Quayle et al. 2006). Table 3 show the values of such parameters for selected networks, which is a sample from a larger table in the Supplementary Materials section.

Vertex ranking methods
We study the preferential perturbation of cancer and normal networks using the vertex ranking methods described in Section 2.3. We are primarily interested in removing combinations of only a few vertices (proteins) to give novel target combinations, which corresponds to a small fraction of the total number of vertices in a given network. Successive vertex removal is applied until a network is fully fragmented and all vertices are removed, which provides useful additional network characterization. We refer to the variation of the perturbation score with the number of vertices removed as a "perturbation profile". Figure 2 shows preferential perturbation profi les averaged across the fourteen cancer and normal network pairs based on the number of vertices removed. Notably a preferential perturbation profi le generally increases up to a maximum, and if more vertices are removed beyond this point the score decreases, as the normal network becomes more fragmented.
As predicted from results developed in (Quayle et al. 2006), Figure 2 shows that no single vertex ranking method is universally optimal, but rather the best method depends on the regime of interest and the defi nition or metric of method success. Alternative metrics of success are discussed further below, but often the extent of perturbation or Table 3. Topological parameters of selected networks, where the network name refl ects the data combination (C1 or C2), cancer type, and whether the network is a cancer or normal network.

Network
Number perturbation score may be the most suitable metric. According to this metric, for small N rm the perturbation scores method is the most effective, and betweenness-based methods are generally more effective than degree-based methods.
For ER and BA model networks degree-based methods are in fact more effective than betweennessbased methods in this regime. The cancer and normal networks are signifi cantly more clustered than equivalent ER and BA model networks of the same size and average degree, as shown in Table 4, which is partly responsible for the observed difference in effectiveness of these methods. The observed clustering fi ts results from previous studies which have shown that many biological networks have an inherent modularity and clustering (Ravasz et al. 2002;Girvan and Newman, 2002).
The success of vertex ranking methods relates closely to whether or not they provide an accurate measure of the "centrality" of vertices in a network. As discussed, degree and betweenness are alternative measures of vertex centrality, and for most network topologies betweenness-based methods are more effective (Holme et al. 2002;Quayle et al. 2006). This is not surprising, since betweenness is a global measure (requires knowledge of network structure) whereas degree is only a local measure (detailed network structure is not needed to measure the degree of a given vertex). The perturbation scores method is a relatively poor measure of vertex centrality, since this method effectively forces at least some perturbation for each vertex removal, and hence tends to target more peripheral vertices. Therefore, although this method on average gives the greatest preferential perturbation score for small N rm , this may not be the most effective predictor of useful targets.
Alternative metrics for analyzing method effectiveness are the score at the maximum in a profi le, and the average perturbation gradient, which is given by, where max represents the values of these parameters at the maximum in a given profile (see (Quayle et al. 2006) for the motivation behind these parameters). When searching for novel target combinations, we focus on combinations of up to 5 targets, in other words the regime for small N rm , below the maximum. The current static topological model gives a conservative estimate of the fragmentation of a real network, since it cannot   Perturbation of Interaction Networks take into account dynamical effects such as possible cascades and subtle dependencies between interactions, which may be observed in the real biological network. Therefore, alternative metrics based on the maximum in a profi le may be more realistic, or at least equally effective predictors of the effect of perturbations on a network, rather than simply the score at a given N rm value. Tables 5 and 6 show values for these metrics averaged across the fourteen network pairs for each vertex ranking method. Vertex betweenness is signifi cantly the most effective method according to both metrics, which cannot be seen in the averaged profi les shown in Figure 2 since maxima occur at different N rm values in different networks. These results show a similar order of method effectiveness as predicted (Quayle et al. 2006), and suggest that the betweenness method may provide the most useful target predictions.
Target results for prostate cancer networks for combinations of up to 5 targets are shown in Table 7, for which we are testing the predictions in our own prostate cancer research group. Although betweenness is the most effective method according to the two metrics as shown in Tables 5 and 6, the fi rst few vertex removals using this method actually give a negative perturbation score for this particular network pair (i.e. the normal network is more fragmented than the cancer network). On the other hand, diffierence in betweenness generates a comparable score to that generated using the perturbation scores method.
The results shown in Figure 2 and Tables 5 to 7 give strong evidence that degree-based methods are less useful than betweenness-based methods. Furthermore, betweenness and difference in betweenness are more effective measures of vertex centrality than degree and difference in degree respectively. What is less clear from the above analysis is exactly which metric of method success is most appropriate, given the current model. Therefore we are initially investigating highly ranked targets from the betweenness, diffi erence in betweenness and perturbation scores in experimental testing.

Correlation tests
Targets with a high betweenness in a given cancer network also tend to have a high betweenness in the underlying network from which the cancer network was derived. This has the result that highly ranked targets according to the betweenness method for a given cancer type are often highly ranked targets in another cancer type, due to the correlation of both cancer networks with the underlying network. The difference in betweenness ranking method selects targets which are less strongly correlated with the topology of the underlying network, and thus the highly ranked targets tend to be more specifi c to a given cancer type. It is possible that both approaches may predict useful novel targets, since some known cancer targets are highly specifi c to a given cancer type (Fukazawa et al. 2004), while others are known to provide a therapeutic response in many cancer types (Shelton et al. 2005).
To analyze the strength of these correlations between the underlying network and the cancer and normal networks, and between different cancer types, we used correlation tests as described in detail in Section 2.5.1. Table 8 shows the vertex betweenness correlation between the underlying networks and cancer networks. Typical correlation coeffi cients are within the approximate range 0.8 < r < 0.9, indicating a signifi cant correlation in this case. Correlations between vertex degree are slightly stronger between a cancer network and the underlying network from which it was derived, but weaker between a cancer network and a different underlying network (see Supplementary Materials). This is an interesting result which shows that betweenness is less sensitive to the  addition or removal of links (between different underlying networks) than degree, and that vertex betweenness captures the overall or core topology of a network more effectively than vertex degree. This inference is consistent with the fact that betweenness provides a better measure of vertex centrality than degree. Since the current underlying networks are only subnetworks of the true or complete interaction network, this result indicates that such a subnetwork can provide a useful representation of the "core" topology of the true interaction network. It is expected that the observed correlations with initial network topology extend to the vertices (targets) selected in perturbations in different cancer types, since perturbation ranking methods are based on betweenness and degree. We investigated the vertex correlations between perturbation results for different cancer types using Spearman's rank correlation coeffi cient, as described in detail in Section 2.5.2. Correlation coeffi cients were calculated between each cancer network or cancer and normal network pair, for each vertex ranking method. The Supplementary Materials contains detailed results for the betweenness and difference in betweenness ranking methods, and Table 9 shows the average rank correlation coeffi cient for each vertex ranking method.
The average correlation coeffi cients for the betweenness and degree ranking methods are signifi cantly higher than those from other methods, since these methods do not take into account the relevant normal network topology. Therefore a target with a high rank in the perturbation results using the betweenness method in a given cancer type, is likely to also have a high rank in the results for another cancer type, if it is present in the network (recall the correlations are only based on common targets). There is less correlation in the perturbation results using other ranking methods, and the diffi erence in betweenness ranking method gives the least correlation, refl ecting the sensitivity of this method to topological differences. Correlations between perturbation results were also compared between the underlying network and the cancer networks, giving similar results for different ranking methods. Table 7. Successive targets obtained using alternative ranking methods and the associated perturbation score for the preferential perturbation of prostate networks of data combination type C1.

Target associations
The outcome of the current work is a set of ranked targets for a given cancer type, which we are testing for their potential as targets for cancer therapy. An approach for assessing the likelihood that these targets are useful, potentially novel cancer targets is to validate the target ranks against known cancer targets. If our methods preferentially select for known cancer targets, this indicates that highly ranked novel targets have a greater likelihood of being useful cancer targets. We applied a global statistical analysis on the ranks of targets with known cancer associations, disease associations and drug associations, using the association data described in Section 2.4. The null hypothesis is that targets with a specifi ed association type are selected randomly from a given network, and the alternative hypothesis is that such targets have a greater chance of selection by our methods, or in other words are more highly ranked than at random. The analysis generates a p-value using hypergeometric probability distributions for each cancer and normal network pair for a given association type and ranking method. These p-values are then combined using the Z-transform test (Whitlock, 2005) to give an overall p-value for each ranking method, for a given association type. The combined p-value therefore measures the ability of a given ranking method to preferentially select for cancer targets, for example. The resulting p-values are given in Table 10, which shows that according to the null hypothesis the betweenness, degree and perturbation scoring methods all strongly select for cancer and disease associated targets. This seems to validate the effectiveness of these methods, but the null hypothesis is naive, since it assumes independence between tests and does not account for possible biases or network correlations as calculated in Section 3.3.2. As discussed, we expect some biases in the underlying interaction data for known cancer targets, since these targets are known to be therapeutically interesting. These biases will tend to increase the degree and betweenness of targets with associations in the underlying networks and cancer and normal networks. We therefore calculate the average degree and betweenness of targets with associations in these networks, and compare them with the average degree and betweenness of all targets in the networks. Tables 11 and 12 show ratio values for the average betweenness for different association types.
The ratio values are consistently greater than 1, which shows that targets with associations have signifi cantly greater betweenness than an average target, both in the cancer networks and the underlying networks. The average value of the equivalent ratios for the average degree are typically around 1.5 (see Supplementary Materials), which is less than the ratios for average betweenness. The betweenness distributions of these networks have a larger spread or standard deviation than that of equivalent degree distributions, which in part explains this difference.
Since targets with associations tend to have a higher than average betweenness and degree in the underlying curated networks, much of the significance observed in Table 10 is due to the infl uence of the underlying network, rather than the cancer specific networks. This result is biologically reasonable, and is potentially very useful for the discovery of cancer targets with therapeutic applications in many cancer types. For example, the well-characterized EGFR target is highly ranked from our methods in many cancer types, and is known to be implicated in many cancer types including prostate cancer (Shelton et al. 2005), where it is the fi fth highest ranked target using the betweenness ranking method. Many other key cancer targets are highly-ranked using our methods in different cancer types, such as SMAD3, MAPK3, RAF1 and TP53, amongst others. In order to quantify the bias in these curated datasets to some extent, we ran a similar analysis using networks generated only from highthroughput data. All methods of network generation and analysis are as described previously except the underlying network is now the complete set of interactions predicted by Lehner & Fraser (Lehner and Fraser, 2004). This high throughput data is biased, but the types of bias are different to those expected in curated data, and therefore such an analysis is useful (Mrowka et al. 2001;Serebriiskii et al. 2000). Tables 13 and 14 show equivalent results for networks generated from high-throughput data to the results in Tables 11 and 12 for curated interaction data (see Supplementary Materials for degree ratios).
The betweenness ratios are much lower for networks generated from high-throughput data than from curated data, which indicates that much of the signifi cance is likely to be due to biases in the data, and gives a clear demonstration of bias in curated interaction data sources. Interestingly, the betweenness and degree ratios are still significant for targets with drug associations, which shows how drugs have typically been developed for relatively "central" targets with high degree and betweenness. We also ran perturbations of networks generated from high-throughput data, and applied the same analysis as described for curated data sources to generate Table 10. Equivalent results from high-throughput data sources are shown in Table 15. In this case, none of the association types are signifi cantly selected, with the exception of drug associations, which are selected by the betweenness and degree methods. These results are consistent with the differences between network properties of targets with associations for curated and high-throughput data as shown in Tables 11 to 14.
The differences in results for curated and highthroughput interaction data sources highlight some of the current problems with interaction data, where curated sources contain signifi cant biases towards well studied targets of interest, and highthroughput sources contain a high percentage of false positives. As more high-quality interaction  data becomes available, a network-level approach will become increasingly important for target and drug discovery. Given the current data limitations, however, our predictions for novel target combinations have returned many targets known to be important in the relevant cancer types, and we are pursuing these predictions for their applicability in cancer therapy. We have also run perturbations where only targets with known drug associations are selected so that the results can be tested using readily available drug combinations. Many of the highest ranked targets with drug associations are also known cancer targets.

Summary
In summary, we have developed and studied a novel method for predicting target and drug combinations based on network topology in multiple cancer types. By simulating network fragmentation from targeting multiple proteins in a given cancer type, such a network-level approach facilitates a search for novel target combinations. Our methods signifi cantly select for cancer associated targets using curated interaction data sources, and return many targets of interest in cancer therapy. When using predicted H. sapiens interaction data generated from high-throughput model organism data sets, the methods do not signifi cantly select for known cancer associations. The difference in results between networks generated from curated data sources and high-throughput data sources reveals signifi cant bias in curated data towards targets of interest, with known associations. We have predicted sets of target and drug combinations in seven different cancer types, which are currently in experimental testing for prostate cancer using both drug studies and siRNA techniques.    VCL  ITGB1  HLA-C  2  TP53  ITGB1  TP53  HSPA1A  ITGB1  3  TLN1  RAF1  TLN1  COL1A1  RAF1  4  SMAD3  TP53  SMAD3  RAF1  SMAD3  5  CTNNB1  PIK3R1  RNF11  VCL  HD  6  PIK3R1  HSPA1A  HD  COL1A2  RAN  7  RAF1  SMAD3  CTNNB1  NFKBIA  HSPA1A  8  RNF11  KIT  APP  XPO1  PTK2B  9  HSF1  TLN1  ITGB1  TGFBR2  ITGA5  10 HD COL1A1 E2F4 RAN LSM1  Table 12. Comparison of perturbation results (target rankings) using Spearman's rank correlation coeffi cient based on the betweenness method. Only the common targets are compared, and the ranks are relative to this target set. Network   Table 13. Comparison of perturbation results (target rankings) using Spearman's rank correlation coeffi cient based on the difference in betweenness method. Only the common targets are compared, and the ranks are relative to this target set. Network